perm filename CRASH.TXT[MUD,SYS]1 blob sn#553536 filedate 1981-01-04 generic text, type T, neo UTF8
On 23 Dec 80 at 2012, SU-AI suffered a disk head crash on system pack number
1, on drive C, which is one of the six system file structure disks.  As a
result of this head crash, which occurred on the Servo head, that entire disk
pack is unreadable and thus one sixth of the entire file system was lost.  (If
the crashed head had been any but the Servo head, then it would probably have
been possible to restore 18/19 of the crashed pack by running it with the
crashed head removed.  But it is impossible to run a pack without the Servo
head and surface, so the whole pack was lost, unfortunately.)

On the dead pack, there were 4965 files whose previous presence could be
detected from remaining file system data (including user directories (UFDs)
and actual pieces of partially lost files).  Of those 4965 files, 433 were
UFDs.  However, the MFD (master file directory, which lists all the UFDs)
escaped completely undamaged, and hence there were no user directories that
disappeared without leaving a trace in the file system (fortunately).

Unfortunately, there had not been a T-dump for 16 days prior to the crash.
This is significant because in a T-dump every file that has not already been
dumped is dumped (backed up onto tape).  At the moment of the crash, there was
a P-dump in progress which was 7/8 finished, and one previous P-dump had been
completed since the latest T-dump.  P-dumps, however, do not dump files less
than 4 days old.  Hence the P dump that was in progress still had failed to
dump many recently changed files.  (Normally T-dumps are done 5 times a week
and P-dumps once a week, but recent confusion about the tape situation caused
a lack of communication and a failure to keep to the normal dumping schedule.
This was very unfortunate.)

Thanks to redundant retrieval information kept in the file system, we were
able to reconstruct the UFD entries for about 2/3 of the damaged files.  In
these cases, some (or all) of the data of each damaged file was lost, but the
directory information that was recovered included the name of the file, its
date/time written, and especially the number of the backup tape, if any, on
which this actual version (the latest version, that is) of the file was
dumped.  Using this tape information from the UFDs, we were able to restore
all files of this type (UFD entry OK) that had been dumped.  Except for about
1 single-bit tape error per thousand files, these restorations are
theoretically perfect (fortunately).

The UFDs that were damaged in the crash were reconstructed as much as possible
using redundant retrieval information from files that were at least partly
undamaged.  This left a certain class of files "missing without a trace".
Those are the files that existed entirely on the crashed pack and whose UFD
entries were on the crashed pack.  However, there is actually some information
left behind indicating the previous existence of these missing files; during
each dump, Dart makes a file that lists each file on the disk at that time
along with the number of the latest tape each file has been dumped on, if any.
Because the latest P-dump hadn't finished, Dart hadn't completed its list of
where all the then-current files live on tape; so we augmented that list with
the corresponding list from the end of the previous P-dump (a complete one).
Users whose programmer names occur at or after STT alphabetically and who had
a UFD damaged in the crash may have lost some files without our knowing it;
those would be new files that didn't exist at the time of the last complete
P-dump.  Users with damaged UFDs will have to (1) figure out what files of
theirs may have disappeared completely and (2) reconstruct such files
themselves (unfortunately).

Besides tape backup of disk files, we have a backup disk copy of the entire
disk, made on 6 April 1980.  We used the April disks to get back about 3120
files (saving us the trouble of restoring most of those same files from about
500 different old P-tapes).  The April disks were also used to get a system up
and running after the system DMP file and some system sources were damaged in
the crash (which didn't discriminated against anyone).  It was fortunate that
we had the disk copy.

UFDs that were entirely lost (although later reconstructed) have lost
their passwords.  Users needing help putting their passwords back should
contact ME or REG.


Any damaged file that (1) was never dumped in any version and (2) did not
exist in the 6 April 1980 backup disk copy is now lost for good.  Users will
have to reconstruct such files themselves.  We have a list of the names of
such known irrecoverable files (472 of them, see list desciptions below).
Note: Some of these irrecoverable files were only partially damaged, and all
of them still exist on the disk, simply with one or more whole track's worth
of data replaced with zeroes.

All files that were damaged and that could be restored from Dart tapes or the
April disk have been restored.  This includes two kinds of files:  perfect
restorations, in which absolutely the latest version of each file has been
restored; and imperfect restorations, in which each file has been restored to
an earlier version because the latest version is not available anywhere.  For
any imperfect restoration whose original file was only partly damaged, the
partly intact carcass of the damaged file is available for merging with any
earlier version that has been restored (the file PARTIA.LOS[MUD,SYS] lists all
the files that were only partially damaged).  For instance, in the case of a
message file, the carcass might include many of the most recently received
messages (there are 43 such message-file carcasses); the restored file will
contain everything prior to some date, and possibly the entire original file
can be reconstructed by combining the two versions.  These carcasses have been
copied to tape before any previous versions were restored; hence the carcasses
are available to users who want to try to use them.  All of these files have
now been put on the disk, on the area [PAR,SYS], under their original
filenames (less the PPN) except for files originally called MSG.MSG or
OUTGO.MSG, which are now called MSG.PRG and OUTGO.PRG, where PRG is the
original programmer name of the file's PPN.  Each carcass is a copy of the
original file but with one or more whole tracks of the original file replaced
by tracks containing all zeroes.  Each actually restored file version has the
exact name, extension and PPN of the original (damaged) file (thus the
carcasses had to be moved somewhere else).

When you are finished using (copying, etc.) the partially OK file versions of
your files that are on [PAR,SYS], please MAIL a note to that effect to ME, so
that the files on [PAR,SYS] can be removed and the disk space freed up.  The
files there will probably not be kept beyond the end of January 1981, but they
will be available on tape thereafter.

For any damaged file whose UFD entry was lost, it is unknown whether the
version restored represents a perfect or an imperfect restoration.  Also,
some files have been restored that had recently been deleted (e.g.,
reaped); such files were restored only on UFDs that were damaged (thus
obscuring which files existed at the time of the crash).  Please check
your areas for such files.  (Note that all files restored, including those
NOT previously deleted, came back written by DMP,SYS using DART.)


Here are some useful files listing particular information about damaged files.

GONE.ALL[MUD,SYS]  (455 files)
	Known files that could not be restored in any version (never dumped)
	and which were totally destroyed in the crash.  These files now
	contain nothing but zeroes (but the right number of zeroes!).
GONE.PAR[MUD,SYS]  (16 files)
	Files only partially damaged that could not be restored in any version.
	Each file has one or more tracks intact.
PARTIA.LOS[MUD,SYS]  (167 files)
	Files only partially damaged in the crash.  Most of these files have
	been restored from the latest or earlier versions.  Each file has
	one or more tracks intact, but the partially damaged version now
	lives on [PAR,SYS] (to make room for the restored version).  If you
	have files in this list, see the main text just above for how to
	find them on [PAR,SYS]).
DART.LST[MUD,SYS]  (2919 files)
	Recently existing files whose data and UFD entry was lost in the
	crash.  Filenames not followed by a tape number could not possibly
	be perfectly restored.  Filenames with a tape number were restored
	but it is unknown whether the restorations are perfect or not
	because the UFD entries for these files were all lost.
CATLOG.NAM[MUD,SYS]  (4965 files)
	All files known to have been on the crashed pack whose UFD entries
	were NOT lost.  For each of these files, it is known whether the
	restoration was perfect or not: if there is a tape number with a
	given file's entry in this list, then the file was restored
	perfectly; if the tape field says "never", then any restoration
	of the file was imperfect ("never" applies only to the latest
	version never having been dumped).  This list is sorted by PPN.
	The beginning entries in this file indicate WHICH UFDs were
	damaged.  Damaged UFDs may have lost some recent files without our
	knowledge.
CATLOG.TAP[MUD,SYS]  (4965 files)
	Just like CATLOG.NAM but sorted by tape number.  Files whose tape
	number is missing (entry says "never") could not possibly be
	perfectly restored.
MRESTO.DUN[MUD,SYS]  (6896 files)
	Master list of files that were restored from tape (or from old disk).


**  Misc notes  **

About 200 tapes were used (in addition to the backup disk packs) to restore
files.  It took about 42 hours to read all those tapes.

The remind queue has been rolled back to its state as of 1421 on 7 Dec 80.
Don't be surprised if you get some old reminders again!

The PUMPKIN list of files to be restored was rolled back to its state on 7
Dec 80.  If shortly before the crash you made a PUMPKIN request that
wasn't carried out, you can request it again.

All NS stories are gone; all *.NAP[2,2] files (NS notification lists) have
been deleted, since they would at best have pointed to stories that no longer
existed.

Various programs improvements were made during the recovery.  Numerous
features to aid in the recovery were made in DART and in the disk auditing
program RALPH.  Several bugs were fixed in DART w.r.t. scanning a command
file, and DART was made more robust with respect to tape errors occurring on
the tape in files not being restored (this was a bug fix that makes all
restorations from imperfect tapes simpler).  COPY can now handle up to about
10,000 files in an indirect file (instead of 175).  Some 50 text pages of code
were written and used in the recovery.

Also, several pieces of hardware got fixed.  Disk drive F, down for about two
years, was finally fixed, using two cards from the crashed drive, so we still
have a UDP drive despite one drive being down because of the crashed head.
When the crashed drive is repaired (and two new replacement cards arrive), we
should be able to put up two UDP drives.  Disk drive A had a fan replaced that
was causing seek errors from an overheated power supply; these seek errors
were what caused us to move the system disk pack onto the drive where it
shortly crashed.  The second disk controller, down since the move to Jacks
Hall, has been fixed (by Jeff Rubin making a guest appearance) and seems to
work fine now, although it is not yet connected to any of the disks.  Magtape
drive B now rewinds again, although it still has some other problems (at least
we were able to use it for file restoration).

This was (I believe) the first head crash ever of an SU-AI removable system
file pack (prior to removable packs, the Librascope disk had suffered
several crashes).  Some UDPs had crashed before.

A new version of WAITS was created that uses only one system file pack
(instead of six), and a pack was structured and used with this WAITS version.
This pack can now be used as either a UDP or as the sytem file pack in this
one-pack system.

A disk pack copy for backup had been planned for what turned out to be
the week after the crash, one week too late.  Also, a disk washing and
head cleaning had already been planned for the near future, although it
might or might not have prevented the crash.  By mistake, the latest
system sources were not dumped after the last new system version was put
up prior to the crash.  So the latest system had to be reconstructed.

The disk purge that had been scheduled never happened.  The disk
crashed as the selection phase of the purge was running.  A few people
got notices that their files were purged, but these notices should be ignored as
no files had actually been deleted before the crash.

Thus endeth the 8th day of captivity for the Stanford recovery team.

Merry Head Crash and a Happy New Disk!

-- ME (for ME & REG, with help from a handful of others) (31 Dec 80, 2345)


*** Late notes: (2 Jan 81, 2230) ***

Some UFDs that were damaged lost their Group Access Bits (e.g., MAS).  For
most UFDs, the group access bits were restored correctly, but for UFDs
that never got logged in on after the access bits were established, the
reconstructed UFDs probably have the access bits cleared.  The owners of
such directories will have to re-establish the desired access bits
themselves, when they notice that they can no longer access directories
that they used to be able to.

Also, more files have been restored from tape.  All the files in the
latest batch were restored from the P-dump that was in progress when the
crash occurred.  (The LOCATE data for these tapes was formerly not
accessible to Dart, since the P-dump hadn't yet finished; it now has
finished.)  There are two kinds of files restored in this group: (1) files
that were not restored at all previously (because we didn't know they had
existed) and (2) files that previously were restored to an older version
than this P-dump contained.  NO FILES OF THIS SECOND TYPE WERE RESTORED IF
THE CURRENT VERSION ON DISK AT THIS TIME HAD BEEN WRITTEN SINCE THE
ORIGINAL RESTORATION.  This means that some restorations could possibly be
improved upon even now, but it will be up to individual users to figure
out how to merge their recent changes with a less ancient restorable
version.  Note that only versions on tapes P1830 through P1836 contain
these "less ancient" versions.  Later P-dumps probably contain re-dumped
earlier versions, or altered copies thereof!  

Here are the lists of the files that were discovered in this latest batch.
Note that the tapes used here, P1830 through P1836, contain NO FILES for
programmer names after STT, except for SYS.)

FOUND.RES[MUD,SYS]  (412 files)
	Files restored from P1830 through P1836 that didn't exist at the
	time this list was made, minus those files that had recently been
	reaped and minus files on UFDs that had been deleted by their
	owners since the crash ([EX,HJL] and [MUS,RDG]).  Most of these
	files were files that we previously didn't even know had existed,
	but they have now been restored.  It is unknown whether these
	restorations are perfect or not, but they are from file versions
	written within a week of the crash.

BETTER.PDU[MUD,SYS]  (40 files)
	Files now restored from P1830 through P1836 which were previously
	restored to older versions.  Again note that some files that could
	have had such better restorations made had already been written
	since the crash, so those file restorations were not improved.
	See the next list for such files.

CHECK.LOC[MUD,SYS]  (11 files)
	Files that could probably have been restored to more recent
	versions than the ones they were restored to (if any).  The
	improved restorations would be from P1830 through P1836 (only!).
	These original restored versions had been modified since the
	crash, so the improved restorations could not be made without
	clobbering recent changes.  It is now up to the owners of these
	files to see if they want to try to merge their current versions
	with the ones on tapes P1830 through P1836.  For each of these
	files, this list indicates the tape numbers and dates written for
	the best restoration and the actual (older version) restoration.
	TO AID IN USERS' USING THESE "BEST" VERSIONS, WE HAVE PUT THE
	"BEST" VERSIONS ON [MUD,SYS] FROM WHENCE USERS CAN MERGE THEM INTO
	THEIR CURRENT VERSIONS.  Please MAIL a message to ME when you are
	through using one of these files of yours from [MUD,SYS].


This is the last phase of the restorations.  Unless something unexpected
is discovered concerning new and exciting missing old data, remaining file
puzzles will be left to the reader as an exercise.  Good luck.

-- ME (2 Jan 81, 2230)